MultiScale-CNN-4mCPred: a multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction

N4-methylcytosine (4mC) is an important epigenetic mechanism, which regulates many cellular processes such as cell differentiation and gene expression. The knowledge about the 4mC sites is a key foundation to exploring its roles. Due to the limitation of techniques, precise detection of 4mC is still a challenging task. In this paper, we presented a multi-scale convolution neural network (CNN) and adaptive embedding-based computational method for predicting 4mC sites in mouse genome, which was referred to as MultiScale-CNN-4mCPred. The MultiScale-CNN-4mCPred used adaptive embedding to encode nucleotides, and then utilized multi-scale CNNs as well as long short-term memory to extract more in-depth local properties and contextual semantics in the sequences. The MultiScale-CNN-4mCPred is an end-to-end learning method, which requires no sophisticated feature design. The MultiScale-CNN-4mCPred reached an accuracy of 81.66% in the 10-fold cross-validation, and an accuracy of 84.69% in the independent test, outperforming state-of-the-art methods. We implemented the proposed method into a user-friendly web application which is freely available at: http://www.biolscience.cn/MultiScale-CNN-4mCPred/.


Background
DNA modifications play a crucial role in some biological processes and diseases, including development [1], aging [2], cancer [3], and regulation of gene expression [4]. Over 17 types of chemical modifications have been identified in DNA so far [5], including 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), 5-carboxylcytosine (5caC), and N6-methyladenine (6mA) [6,7]. DNA methylation is an important epigenetic modification, which is catalyzed by a family of DNA methyltransferases. DNA methylation is involved in heavy metal modification and regulation of chromatin as well as gene expression [8]. Methylation has widely been found in *Correspondence: guohuahhn@163.com prokaryotic and eukaryotic genomes, including N4-methylcytosine (4mC) [9], 5-methylcytosine (5mC) [10], and N6-methyladenine (6mA) [11,12]. The 5mC is formed by transferring a methyl group from S-adenyl methionine to the fifth carbon of a cytosine residue [13] and has been widely demonstrated to play an important role in the biological progression associated with diabetes, cancer, and some neurological diseases [14][15][16]. The well-known DNA 6mA is a process of adding a methyl group to the 6-th position of an adenine ring catalyzed by DNA methyltransferases [17]. The 6mA is an essential epigenetic modification. The 4mC formation is catalyzed by N4-cytosine-specific DNA methyltransferases, which methylate the amino group at the fourth position of cytosine [18]. The 4mC is involved in such biological functions as host defense, transcription regulation, gene expression, and DNA replication [19]. In comparison with the 5mC and the 6mA, little was known about the 4mC modification and functions.
Detecting 4mC sites is critical to exploring 4mC functions. Many experimental methods have been developed to detect 4mC sites over the past decade, including single-molecule real-time sequencing (SMRT) and bisulfite sequencing [20]. These experimental methods generally are labor-intensive and time-consuming. The emergence of bioinformatics makes it possible to identify 4mC sites on a large scale. Hundreds of computational tools have been developed over the past thirty years for predicting post-translational modification [17,[21][22][23], RNA modification [24][25][26], single-cell analysis [27], protein functions [28,29], as well as protein structure [30], gene selection [31], cancer diagnosis [32], and even food recommendation [33]. With the help of computers of powerful computational ability, the computational methods made great progress in the protein structure recognition, which is thought as one of best challenging tasks. AlphaFold [30] is approaching accuracy of the experimentally determining protein structure. The vast achievement attributed to design of machine learning algorithm, which refined high level representation of protein structure.
Although vast efforts were paid to promote the predictive accuracy of identifying 4mC sites, there are some drawbacks to overcome. Representations of sequence are critical in the feature engineering-based methods. Most representations are degenerative mapping. That's to say, during the course of converting sequences into representation, there would be information lost. Ensemble of more representations would make the model improve predictive accuracy, but added the computational costs. Most representations are based on character statistics, disregarding the relationship between characters, namely contextual semantics. The deep learning-based methods are very promising to promote significantly predictive accuracy of 4mC recognitions. CNNs are good at characterizing local property, while the GRU and the long short-term memory (LSTM) do well in capturing contextual semantics. Different scale CNN reflects different scale information of sequences. The 4mCPred-CNN [38] failed to exploit multi-scale information.
To overcome the drawbacks above we proposed a deep neural network-based computational method named MultiScale-CNN-4mCPred for mouse 4mC site prediction. The MultiScale-CNN-4mCPred consisted of adaptive embedding, bidirectional LSTM (Bi-LSTM) followed by three scale CNNs, and fully-connected network. DNA sequences resemble sentences in the natural language, and thus there would be semantics hidden in it. The LSTM was intended to extract the semantics of 4mC DNA sequences, and the multi-scale CNN was intended to extract sequence representation at a different scale. The highlight of the MultiScale-CNN-4mCPred was summarized as follows. Firstly, the MultiScale-CNN-4mCPred combined the multi-scale CNNs and Bi-LSTM, and used them to capture scale representations and semantics respectively, which promoted predictive accuracy. The 4mCPred-CNN [38] didn't use LSTM or GRU to capture semantics, while the Mouse4mC-BGRU [39] didn't exploit CNNs to characterize local property. Secondly, the MultiScale-CNN-4mCPred used the adaptive embedding which is dynamic to represent sequence with comparison to one-hot encoding.

Optimizing combination of different scale CNNs
Multiscale refers to signal sampling at different scales, and different characteristics can be observed at different scales. Larger scale characterizes global information, while the smaller scale reflects local characteristics. Therefore, the CNNs with different scales are of capacity to capture different information. We investigated the effect of the combination of different scale CNNs on the performances. We randomly selected 80% samples of the training dataset as the training samples and the remaining 20% as the validation set. Table 1 showed the performances of the combination of different scale CNNs. It was obviously observed that the combination of CNNs with the kernel sizes 3, 5, and 7 substantially outperformed others, indicating that the combination of these three scales can characterize well global, medium, and local structure. Therefore, we chose the above three scales for later experiments.

Comparison with different embedding
For text sequences, there are many frequently used embedding to represent it such as one-hot encoding and Word2Vec [42,43]. One-hot encoding is well-known encoding in the natural language processing, where a word corresponds a 0/1 binary vector. Onehot encoding is unable to capture textual semantic relationship between words. Word-2vec translated each word into a contiguous vector which contains intrinsic semantics in the sequence. We compared the adaptive embedding with two types of embedding. In the Word2Vec, we considered 1, 2, and 3 contiguous nucleotides as a word respectively. Except for the embedding, the other parts of model are identical. As shown in Table 2, the overall performance of the adaptive embedding over the validation set was the best, with three indices (Sn, Acc, and MCC) to be in the lead. This is a reason to choose the adaptive embedding, not One-hot or Word2Vec.

Comparison with state-of-the-art methods
As mentioned previously, many computational methods have been developed to predict DNA N4-methylcytosine in mouse genome over the recent ten years, such as Mouse4mC-BGRU [39], i4mC-Mouse [37], and 4mCPred-EL [36]. We performed the 10-fold cross-validation and the independent test to compare them. In the 10-fold crossvalidation, the training set was randomly divided into ten parts in equal or approximate size. Nine parts were used for training the model, and the remaining one part for testing the trained model. This process was repeated ten times. In the independent test, the whole training set was used for training the model, and the testing set was used for testing the trained model. As shown in Table 3, Our MultiScale-CNN-4mCPred reached competitive performances with these state-of-the-art methods over the 10-fold crossvalidation. For example, the MultiScale-CNN-4mCPred obtained the best Acc, the second-best MCC as well as Sn, and the third-best Sp. In addition, our method was more tradeoff between Sn and Sp than other methods, i.e., both Sn and Sp exceeded 0.8. Table 4 listed the performances over the independent test. Except for Sp, all indices of the MultiScale-CNN-4mCPred were best. The MultiScale-CNN-4mCPred increased Sn by 0.0563 over the Mouse4mC-BGRU [39], by 0.0492 over the i4mC-Mouse [37], and by 0.0991 over the 4mCpred-EL [36]. The MultiScale-CNN-4mCPred increased Acc by 0.0219 over the Mouse4mC-BGRU [39], by 0.0308 over the i4mC-Mouse [37], and by 0.0559 over the 4mCpred-EL [36]. The MultiScale-CNN-4mCPred outperformed Mouse4mC-BGRU [39] by 0.0429 MCC, the i4mC-Mouse [37] by 0.0609 MCC, and the 4mCpred-EL [36] by 0.1099 MCC. Therefore, according to the performances over both cross-validation and the independent test, the MultiScale-CNN-4mCPred is superior to these three methods. We compared MultiScale-CNN-4mCPred with the newly developed method 4mCPred-CNN [38] which is implemented into a user-friendly webserver: http:// nsclb io. jbnu. ac. kr/ tools/ 4mCPr ed-CNN/. We uploaded the independent set to the webserver of the 4mCPred-CNN [38] for conducting prediction. We counted the predicted results under various thresholds, as listed in Table 5. As a comparison, we also listed performance of MultiScale-CNN-4mCPred under various thresholds. The MultiScale-CNN-4mCPred was superior to the 4mCPred-CNN [38] in value of Acc and MCC. These results indicated that MultiScale-CNN-4mCPred is a competitive and advanced computational method for predicting DNA 4mC sites.

Contributions of CNNs to 4mC prediction
We also investigated the contributions of CNNs with different scales to 4mC prediction.   Sp decrease. Table 7 showed the performance of the MultiScale-CNN-4mCPred with the single-scale CNN. Different scale CNNs contributes differently. The CNN with the size 3 contributed mainly to Sn, the size 5 CNN to Sp, and the size 7 CNN to Sp and then to Sn. Therefore, the multi-scale CNNs contributed jointly to 4mC prediction, which is a reason to use it.

Generalized ability
We also tested the proposed method for the ability to predict 4mC sites in other species. We downloaded 6 training and 6 independent datasets respectively from six species: A. thaliana, C. elegans, D. melanogaster, E. coli, G. pickeringii, and G. subterraneus at http:// thegl eelab. org/ Meta-4mCpr ed/ 4mCPr edData. html [34]. The numbers of samples in the training and independent datasets were 3956 and 2500 in A. thaliana, 3108 and 1500 in C. elegans, 3538 and 2000 in D. melanogaster, 776 and 268 in E. coli, 1138 and 400 in G. pickeringii, and 1812 and 700 in G. subterraneus. We used the training dataset to train MultiScale-CNN-4mCPred, and used the independent dataset from the same species to test the trained model. Table 8 listed performances over independent datasets. In terms of MCC, the MultiScale-CNN-4mCPred performed better than both DeepTorrent [44] and 4mCPred [26] over three species: A. thaliana, C. elegans and

Conclusions
The 4mC is one of the epigenetic mechanisms. Precise and large-scale detecting 4mC sites is still challenging. We presented a multi-scale CNNs and adaptive embeddingbased method called MultiScale-CNN-4mCPred for predicting 4mC sites in the mouse genomes. The MultiScale-CNN-4mCPred reached the state-of-the-art performance in the mouse genome, and showed curtain ability to predict DNA 4mC across species. This method is end-to-end without manual feature extraction, and is simple to realize. Different scales reflect different characterizations of 4mC. The MultiScale-CNN-4mCPred failed to discover the biological meaning of the different scale characterization, and textual semantic of single nucleotide residue. We also developed a user-friendly web application which is convenient to predict 4mC sites. The proposed method and the developed tool are beneficial to epigenetic research. In the future, we shall employ the Transformer [45] or the BERT [46] to improve the semantical representation of sequences, and exploit attention mechanism to promote interpretability of models.

Methods
As shown in Fig. 1, the presented MultiScale-CNN-4mCPred is a deep neural networkbased method, consisting mainly of adaptive embedding [39], bi-directional long shortterm memory (Bi-LSTM) [47], multi-scale CNNs [48,49], and three fully-connected layers. The DNA letter sequences were first transformed into sequences of integers. Then, the adaptive embedding layer was used to map sequences of integers into continuous vectors. The Bi-LSTM layer was used to extract contextual semantics from DNA sequences. The multi-scale CNN layer was used to extract local features through multiple kernels of different scales. To reduce or avoid overfitting, the dropout was attached at the end of the multi-scale CNN layer and the first fully-connected layer, respectively. The multi-scale representations were concatenated to be fed into three fully-connected layers. The final output represented probabilities of predicting inputs as 4mC. The MultiScale-CNN-4mCPred is an end-to-end learning model. That's to say, the DNA sequence were fed into the Multi-Scale-CNN-4mCPred and the predictive results were directly returned without any manual operations. The parameters in each layer of the MultiScale-CNN-4mCPred were listed in Table 9. The total number of trainable parameters was 29661.  Table 9 Number of parameters and shape of output in the proposed

Character encoding
The DNA sequence consisted of four characters: A, T, G, and C. The deep neural network cannot directly process the character sequences. Therefore, the character must be converted into numerical sequences. Here, we defined a map as follows:

Adaptive embedding
Embedding is a way to transform discrete variables into continuous representations [50]. In neural networks, the embedding technique not only reduces the spatial dimension of discrete variables, but also meaningfully characterizes the variable. For example, the word2vec [42,43] is a frequently used embedding algorithm in the field of natural language processing, where an interesting and famous example is that vector (king) − vector (man) + vector (women) ≈ vector (queen). On the contrary, the traditional one-hot encoding method translated each word into a binary vector, which has two limitations. Firstly, when the vocabulary was large, the vector generated by one-hot might be very sparse. Secondly, the one-hot encoding assumed that words were independent of each other. Thus, similarities between words could not be measured. We used adaptive embedding [39] as the first layer of the MultiScale-CNN-4mCPred.

CNN
CNN is a well-known neural network architecture [48]. CNN first appeared in the LeNet-5 which was a multi-layer neural network [51,52] and was used to distinguish handwritten digits. The LeNet-5 used the backpropagation algorithm for training. The CNN attracted considerable attention for the AlexNet [53], a deep CNN neural network architecture that reduced significantly the error rate of image classification in comparison with other state-of-the-art methods. Since then, many variants of deep CNN architecture have been proposed, such as ZFNet [54], VGGNet [55], GoogleNet [56], and ResNet [57]. It has widely been demonstrated that these variants are of effectiveness and efficiency in most cases. For example, the ResNet [57] won the champion of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015. CNN is becoming one of the most popular and practically used components in the field of deep learning.
The CNN contained three basic components: convolution, pooling, and activations [58]. The aim of the convolution is to compute a feature map. Each neuron in the feature map is linked to a region of neighboring neurons in the previous layer which is generally called the receptive field. The kernel (also called filters) convolves with receptive fields to generate a feature map. The receptive fields slide at a certain stride in horizontal or vertical directions to generate various convolved results. The kernel is learnable and shared by all receptive fields. Various kernels yielded various feature maps. In most cases, multiple kernels are used to characterize the studied object from multiple perspectives. The convolved results were non-linearly transformed by the activation function. The commonly used activation functions include Sigmoid [59], Tanh [60], and ReLU [61], The pooling is to reduce the dimension of the feature map, and thus reduced the complexity of training deep neural networks. The pooling included max pooling, min pooling, and mean pooling.

Bi-directional long short-term memory
Recurrent neural network (RNN) is a different architecture of neural network from the CNN. At the time step t in the RNN, the state of the hidden node h (t) was not only associated with the current input x (t) , but also with the hidden state h (t−1) at the previous time step t-1, which was computed by where W hx denoted the linking weight between the input and the hidden state, and W hh was the linking weight between the hidden states. The output at the time step t was computed by where W yh denoted linking weight between output and hidden state, b y denotes bias. The marked trait of the RNN was that all the nodes shared linking weights. The RNN is especially suitable for sequence analysis. Although RNN was capable of solving the order problem of neural network in time-based sequence, the RNN still had some limitations [62,63]. The major limitation was to vanish and to explode for the gradient. The LSTM [64] solved well this limitation of the early RNN by replacing the traditional nodes of the hidden layer with a memory cell which was a unit of computation [65].
The common long short-term memory (LSTM) cell contained input, the previous hidden state, the previous cell state, the current hidden state, the current cell state, a candidate cell state, and three gates: forget gate, input gate, and output gate. The cell state was equivalent to the path of information transmission, which was the horizontal line with only some minor linear interactions, running straight down the entire chain, so that the previous information could flow along the sequence unchanged. The cell state was viewed as the "memory" of the neural network [62]. In theory, the cell state could transmit the relevant information forever. The addition and the removal of information in the sequence were controlled by the gates. The forget gate determined what information was preserved from the cell state. The forget gate outputted a number between 0 and 1, 0 representing completely getting rid of the information and 1 completely keeping it [62]. The input gate decided which value was updated. The new candidate cell state was created by the tanh function. The cell state was updated by multiplying the previous cell state by the forget gate, and by multiplying the candidate cell state by the input gate. The hidden state was updated by multiplying the output gate by tanh of the cell state. All the computations in each LSTM cell were listed as follows: In equations above, W fx and W fh denoted the linking weight between the input and the forget gate as well as the linking weight between the forget gate and the hidden state. W ix , W ih , W gx and W gh denoted the linking weights between the input and the input gate, between the input gate and the hidden state, between the candidate cell and the input, as well as between the candidate cell and the hidden state, respectively. Besides, W ox and W oh denoted the linking weight between the input and the output gate, as well as between output gate and hidden state, respectively. The symbols σ and ∅ represented the activation function sigmoid and tanh, respectively, and the symbol ⊙ represented the point by point multiplication of vectors.
The LSTM allowed the information to flow from the past to the future. In some cases, the output was not only related to the previous input, but also might be related to the future input. A Bi-LSTM proposed by Schuster et al. [47] addressed well the issue.

Dropout and flatten layer
The dropout was pioneered by Hinton et al. [66]. The dropout is that in the training stage, a certain proportion of the neurons were dropped out and parameters of only preserved neurons were updated. The aim of designing dropout have two-fold: avoiding overfitting and reducing the complexity of computation. Due to its effectiveness and efficiency, the dropout is increasingly becoming a frequently used trick in the deep learning area [67][68][69].
The flatten layer is to play the role of a bridge, which links the fully-connected layer to the non-fully connected layer. The flatten layer transformed the output of non-fully connected layer into one dimension, so that it could link to the subsequent fully-connected layer. The flatten layer has no trainable parameters. The fully-connected layer was similar to the hidden layer in the multilayer perceptron, where each neuron was connected to all the neurons in the preceding layer.

Datasets
For a fair comparison with the state-of-the-art methods, we used the same dataset as Mouse4mC-BGRU [39], 4mCPred-CNN [38], i4mC-Mouse [37], and 4mCPred-EL [36]. The dataset originated from the MethSMRT [70], which was the first database to deposit DNA 6mA and 4mC sites. The process of MethSMRT [70] collecting the dataset was (5) briefly described as follows. The MethSMRT [70] firstly retrieved the SMRT sequencing data from the NCBI Gene Expression Omnibus (http:// www. ncbi. nlm. nih. gov/ geo/) [71] and Sequence Read Archive (http:// www. ncbi. nlm. nih. gov/ sra) [72] and then employed PacBio SMRT analysis platform which is available at http:// www. pacb. com/ produ cts-and-servi ces/ analy tical-softw are/ smrt-analy sis/ analy sis-appli catio ns/ epige netics/ to detect 6mA and 4mC site. Reads with less than 50 nucleotides (nt), or a low-quality region (read score < 0.75 by default) were filtered out by using SFilter. The reads left were aligned to the reference genome by pbalign. The sites with less than 25-fold coverage per strand or with less than 20 modification scores were removed. The MethSMRT [70] hosted methylations of 156 species, including 7 eukaryotes and 149 prokaryotes. The DNA sequences were divided into windows of 41 base pairs with the cytosine at the center of it. The windows containing the 4mC sites were considered as positive samples and others were as negative ones. To reduce or remove the influence of sequence homology on the methods, the CD-HIT programming [73] was used to cluster the original sequences. The sequence identity was set to 0.7. Therefore, among the preserved sequences, the sequence identity between any two was less than 0.7. The same number of negative samples as the positive samples were randomly selected to a keep balance between them. All the samples were divided into the training set and the testing set. The training set consisted of 746 positive and 746 negative samples, while the testing set consisted of 160 positive and 160 negative samples.

Evaluation metrics
We adopted Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and Matthews correlation coefficient (MCC) for evaluating the performances, which were respectively defined as follows: where TP represented the number of the true 4mC samples that were correctly predicted to be 4mC, TN the number of non-4mC samples that were correctly predicted to be non-4mC, FP the number of non-4mC samples that were incorrectly predicted to be 4mC, and FN the number of 4mC samples that were incorrectly predicted to be non-4mC.
The MultiScale-CNN-4mCPred was implemented by Tensorflow 2.9.1 framework, and all scripting programs were written via Python 3.8.12. The entire project was run on a Windows 10 system configured with 2.42 GHz CPU, 2 GB GPU, and 16 GB RAM. The learnable parameters in MultiScale-CNN-4mCPred were randomly initiated. 20% of the training set is randomly sampled as the validation set.

Web server
For the purpose of conveniently using the MultiScale-CNN-4mCPred for 4mC site prediction, we implemented the proposed method into a user-friendly web server, which is freely accessible at http:// www. biols cience. cn/ Multi Scale-CNN-4mCPr ed/. The web server is very easy to use. Users paste or upload DNA sequences in the FASTA format, and then click the "submit" button. The web server will perform predictions and return the predicted results to users.